tidyverse packagesSource: R for Data Science
data.frame() function with named vectors as input.df <- data.frame(
x = 1:4,
y = c("a", "b", "c", "d"),
z = c(TRUE, FALSE, FALSE, TRUE)
)
df
x y z 1 1 a TRUE 2 2 b FALSE 3 3 c FALSE 4 4 d TRUE
# str() function applied to data frame is useful in determining variable types
str(df)
'data.frame': 4 obs. of 3 variables: $ x: int 1 2 3 4 $ y: chr "a" "b" "c" "d" $ z: logi TRUE FALSE FALSE TRUE
# dim() function behaves similar to matrix, showing N rows and N columns, respectively
dim(df)
[1] 4 3
# In contrast to matrix length() of data frame displays the length of underlying list
length(df)
[1] 3
l <- list(x = 1:5, y = letters[1:5], z = rep(c(TRUE, FALSE), length.out = 5))
l
$x [1] 1 2 3 4 5 $y [1] "a" "b" "c" "d" "e" $z [1] TRUE FALSE TRUE FALSE TRUE
df <- data.frame(l)
df
x y z 1 1 a TRUE 2 2 b FALSE 3 3 c TRUE 4 4 d FALSE 5 5 e TRUE
str(df)
'data.frame': 5 obs. of 3 variables: $ x: int 1 2 3 4 5 $ y: chr "a" "b" "c" "d" ... $ z: logi TRUE FALSE TRUE FALSE TRUE
# Like a list
df[c("x", "z")]
x z 1 1 TRUE 2 2 FALSE 3 3 TRUE 4 4 FALSE 5 5 TRUE
# Like a matrix
df[,c("x", "z")]
x z 1 1 TRUE 2 2 FALSE 3 3 TRUE 4 4 FALSE 5 5 TRUE
df[df$y == "b",]
x y z 2 2 b FALSE
rbind() (row bind) - appends a row to data framecbind() (column bind) - appends a column to data framerand <- rnorm(5)
rand
[1] -1.6395385 -0.6401171 1.4880066 -0.4978420 -1.3442429
df <- cbind(df, rand)
df
x y z rand 1 1 a TRUE -1.6395385 2 2 b FALSE -0.6401171 3 3 c TRUE 1.4880066 4 4 d FALSE -0.4978420 5 5 e TRUE -1.3442429
# Note that a row has to be a list as it contains different data types
r <- list(6, letters[6], FALSE, rnorm(1))
r
[[1]] [1] 6 [[2]] [1] "f" [[3]] [1] FALSE [[4]] [1] -0.2291225
df <- rbind(df, r)
df
x y z rand 1 1 a TRUE -1.6395385 2 2 b FALSE -0.6401171 3 3 c TRUE 1.4880066 4 4 d FALSE -0.4978420 5 5 e TRUE -1.3442429 6 6 f FALSE -0.2291225
tibble from tibble package (part of tidyverse package ecosystem)data.table from data.tabletibble provides features enhancing user experience (readability, ease of manipulation)data.table provides speeddt <- data.table::data.table(
x = 1:4,
y = c("a", "b", "c", "d"),
z = c(TRUE, FALSE, FALSE, TRUE)
)
dt
x y z 1 1 a TRUE 2 2 b FALSE 3 3 c FALSE 4 4 d TRUE
tidyverse packages¶tidyverse package ecosystem - rich collection of data science packagesreadr - data input/output (also readxl for spreadsheets, haven for SPSS/Stata)dplyr - data manipulation (also tidyr for pivoting)ggplot2 - data visualisationlubridate - working with dates and timetibble - enhanced data frameinstall.packages("tidyverse")
tibble::tibble() functiontibble::as_tibble() functiontb <- tibble::tibble(
x = 1:4,
y = c("a", "b", "c", "d"),
z = c(TRUE, FALSE, FALSE, TRUE)
)
tb
x y z 1 1 a TRUE 2 2 b FALSE 3 3 c FALSE 4 4 d TRUE
str(tb)
tibble [4 × 3] (S3: tbl_df/tbl/data.frame) $ x: int [1:4] 1 2 3 4 $ y: chr [1:4] "a" "b" "c" "d" $ z: logi [1:4] TRUE FALSE FALSE TRUE
dim(tb)
[1] 4 3
tb[c("x", "z")]
x z 1 1 TRUE 2 2 FALSE 3 3 FALSE 4 4 TRUE
tb[tb$y == "b",]
x y z 1 2 b FALSE
# New columns can also be created/modified by assignment (if the RHS object has correct length)
tb["r"] <- rnorm(4)
tb
x y z r 1 1 a TRUE -0.63905096 2 2 b FALSE -0.40466580 3 3 c FALSE 0.49230918 4 4 d TRUE 0.09646717
# Individual columns can also be selected with $ operator
tb$r <- tb$r + 5
tb
x y z r 1 1 a TRUE 4.360949 2 2 b FALSE 4.595334 3 3 c FALSE 5.492309 4 4 d TRUE 5.096467
# names() attribute for data frames/tibbles contains column names
names(tb)
[1] "x" "y" "z" "r"
names(tb)[4] <- "rand"
tb
x y z rand 1 1 a TRUE 4.360949 2 2 b FALSE 4.595334 3 3 c FALSE 5.492309 4 4 d TRUE 5.096467
dplyr¶dplyr - is one of the core packages for data manipulation in tidyverseIts principal functions are:
filter() - subset rows from datamutate() - add new/modify existing variablesrename() - rename existing variableselect() - subset columns from dataarrange() - order data by some variableFor data summary:
group_by() - aggregate data by some variablesummarise() - create a summary of aggregated variableslibrary("dplyr")
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
dplyr examples¶dplyr::filter(tb, y == 'b', z == FALSE)
x y z rand 1 2 b FALSE 4.595334
# Note that dplyr functions do not require enquoted variable names
dplyr::select(tb, x, z)
x z 1 1 TRUE 2 2 FALSE 3 3 FALSE 4 4 TRUE
# We can also use helpful tidyselect functions for more complex rules
dplyr::select(tb, tidyselect::starts_with('r'))
rand 1 4.360949 2 4.595334 3 5.492309 4 5.096467
dplyr examples continued¶# Data is not modified in-place, you need to re-assign the results
tb <- dplyr::rename(tb, random = rand)
dplyr::mutate(tb, random_8plus = ifelse(random >= 8, TRUE, FALSE))
x y z random random_8plus 1 1 a TRUE 4.360949 FALSE 2 2 b FALSE 4.595334 FALSE 3 3 c FALSE 5.492309 FALSE 4 4 d TRUE 5.096467 FALSE
%>% operator¶tidyverse packages are encouraged to use pipe operator %>%|> but it is still relatively uncommon<result> <- <input> %>%
<function_name>(., arg_1, arg_2, ..., arg_n)
<result> <- <input> %>%
<function_name>(arg_1, arg_2, ..., arg_n)
%>% operator examples¶tb
x y z random 1 1 a TRUE 4.360949 2 2 b FALSE 4.595334 3 3 c FALSE 5.492309 4 4 d TRUE 5.096467
tb <- tb %>%
dplyr::mutate(random_2 = rnorm(4)) %>%
dplyr::filter(z == FALSE)
tb
x y z random random_2 1 2 b FALSE 4.595334 -0.9099916 2 3 c FALSE 5.492309 -0.1632015
%>% operator vs built-in |> operator¶|> pipe operator # Pipe %>% can also be used with non-dplyr functions
tb$x %>% .[2]
[1] 3
# Base R pipe operator |> is more restrictive (e.g. tb$x |> `[`(2) doesn't work)
tb |> nrow()
[1] 2
tidyr::pivot_wider())tidyr::pivot_longer())pivot_wider() |
pivot_longer() |
Source: R for Data Science
tb2 <- tibble::tibble(
country = c("Afghanistan", "Brazil"),
`1999` = c(745, 2666),
`2000` = c(37737, 80488)
)
tb2
country 1999 2000 1 Afghanistan 745 37737 2 Brazil 2666 80488
tb2 <- tb2 %>%
# Note that pivoting functions come 'tidyr' package
tidyr::pivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "cases")
tb2
country year cases 1 Afghanistan 1999 745 2 Afghanistan 2000 37737 3 Brazil 1999 2666 4 Brazil 2000 80488
tb2 <- tb2 %>%
tidyr::pivot_wider(names_from = "year", values_from = "cases")
tb2
country 1999 2000 1 Afghanistan 745 37737 2 Brazil 2666 80488
.csv (Comma-separated value) files for storing tabular data.rds (R data serialization) files allow to store single R objectpickle.rda (R data) files for saving and loading multiple R objects.feather/.parquet - big data formats associated with Apache Hadoop ecosystem.csv (Comma-separated value) read.csv()/write.csv() - base R functionsreadr::read_csv()/readr::write_csv() - functions from readr package in tidyverse.rds (R data serialization) readRDS()/writeRDS() - base R functionsreadr::read_rds()/readr::write_rds() - functions from readr (no default compression).rda (R data)save()/load() - base R functions.feather/.parquetarrow::read_feather()/arrow::write_feather() - functions fromarrow::read_parquet()/arrow::write_parquet() - arrow package in Apache Arrow# We are skipping the first row as this dataset has a composite header of 2 rows (variable name, question)
kaggle2021 <- readr::read_csv('../data/kaggle_survey_2021_responses.csv', skip = 1)
Rows: 25973 Columns: 369 ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "," chr (360): What is your age (# years)?, What is your gender? - Selected Choi... dbl (1): Duration (in seconds) lgl (8): In the next 2 years, do you hope to become more familiar with any... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(kaggle2021[,1:10])
| Duration (in seconds) | What is your age (# years)? | What is your gender? - Selected Choice | In which country do you currently reside? | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? | Select the title most similar to your current role (or most recent title if retired): - Selected Choice | For how many years have you been writing code and/or programming? | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R | What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL |
|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
| 910 | 50-54 | Man | India | Bachelor’s degree | Other | 5-10 years | Python | R | NA |
| 784 | 50-54 | Man | Indonesia | Master’s degree | Program/Project Manager | 20+ years | NA | NA | SQL |
| 924 | 22-24 | Man | Pakistan | Master’s degree | Software Engineer | 1-3 years | Python | NA | NA |
| 575 | 45-49 | Man | Mexico | Doctoral degree | Research Scientist | 20+ years | Python | NA | NA |
| 781 | 45-49 | Man | India | Doctoral degree | Other | < 1 years | Python | NA | NA |
| 1020 | 25-29 | Woman | India | I prefer not to answer | Currently not employed | < 1 years | Python | NA | NA |
# Note that summary() as opposed to pandas' describe() gives summary for all variable types by default
summary(kaggle2021[,1:10])
Duration (in seconds) What is your age (# years)?
Min. : 120 Length:25973
1st Qu.: 443 Class :character
Median : 656 Mode :character
Mean : 11055
3rd Qu.: 1038
Max. :2488653
What is your gender? - Selected Choice
Length:25973
Class :character
Mode :character
In which country do you currently reside?
Length:25973
Class :character
Mode :character
What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
Length:25973
Class :character
Mode :character
Select the title most similar to your current role (or most recent title if retired): - Selected Choice
Length:25973
Class :character
Mode :character
For how many years have you been writing code and/or programming?
Length:25973
Class :character
Mode :character
What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
Length:25973
Class :character
Mode :character
What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
Length:25973
Class :character
Mode :character
What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL
Length:25973
Class :character
Mode :character
# table() function is rather flexible in allowing to tabulate a single variable and do crosstabs
table(kaggle2021[3])
Man Nonbinary Prefer not to say
20598 88 355
Prefer to self-describe Woman
42 4890
# Wrapping it inside prop.table() gives proportions of each category
prop.table(table(kaggle2021[3]))
Man Nonbinary Prefer not to say
0.793054326 0.003388134 0.013668040
Prefer to self-describe Woman
0.001617064 0.188272437
# Wrapping it inside sort() gives value sorting, as opposed to alphabetic (or facto levels)
sort(table(kaggle2021[3]), decreasing = TRUE)[1]